2016 Election Analysis with a focus on machine learning

by Ruben Olmos

In [38]:
library(dplyr)
library(readr)
library(kableExtra)
library(ggplot2)
library(maps)
library(tidyr)
library(tidyverse)  
library(cluster)    
library(factoextra) 
library(dendextend) 
library(ROCR)
library(glmnet)
library(caret)
library(tree)
library(maptree)
library(ggridges)
library(viridis)
library(hrbrthemes)
library(IRdisplay)

Introduction and Background

Predicting Voter behavior is a very difficult task that takes into consideration lots of variables that may change quickly, also known as "chaos", skewing your results.Polls are essential to voter behavior predictions, but only if they are done correctly with random sampling and large sample sizes. In addition, you are relying on polls that are at risk of bias with low response rates, inaccurate polling, and that they do not meausure voter turnout. Another reason voter behavior prediction is hard is because by the time you complete your analysis, the variables and their relations to each other may have already changed. This can lead to a completely different result than what's expected, misleading the audience. This is espcially crucial to Campaign leaders who need to comprehend what areas they need to focus on as their time is extremely valuable. Lastly, how infrequent elections occur makes it difficult for statisticians to be able to practice and perfect models to accurately predict elections. Occurring every 4 years with, about 1 year of strong campaigning, it's hard to collect data that will accurately portray how election night will go.

However despite all the implications of complications of predicition of election results there were some methods whose predictions of results fell very close to the actual ones. In particular methods of prediction used by Nate Silver provided predictions that fell very close to the actual results. Silver was able to model voting behavior and correctly predict the election result for all 50 states. He utilize hierarchical modeling in order to take into account and analzye the state and national level voting together. He modeled an observed variable of the intended voting behavior in each state which he then used in order to predict the voting results. What made his model stand out from others was that when analyzing poll percentages, actual percentage + the house effect + sampling variation = poll percentage, he utilized a different approach in calculating the probability when "more variation is less likely" to determine which actual percentage has the greatest probability. Rather than examining the maximum probability, Silver calculated the porbability of support for each day and then use the model for the next day to ask what is the probability that support changed from x% to x%. He achieved this by using Bayes' Theorem and graph theory to determine new probabilities of each level of support. With his probabilities of each level of support, he was able to incorporate hierarchical modeling to produce predictions for each level of estimated support on the state and national level on any day leading up to the election.

So what went wrong with the prediction results in 2016 that placed Clinton as a clear favorite? The inaccurate predictions can be attributed to poor polling, misinterpretations and mediocre analysis, and weak election models. In 2016, there was inadequate state polling that were relied on in order to develop models or create prediction forecasts. Because of this, we saw a sweep of all national polls missing in the same direction. Weak election models are also believed to be of blame as it could've caused some forecasting erros as there are so many variables and small changes that can result in big changes. In order to ensure we have better predictions in the future I believe that polling has to be a top priority in that they are correctly receiving data with random sampling, large sample sizes, paying extra attention to swing states in making sure those models are correct, and develop methods to help predict voter turnout as well as having an adequate turnout on election day. Analysis on the weights of certain factors, such as if the candidate is an incumbent, would also be helpful such that statisticians would be able to develop models that accurately portray voter behavior change.

In [39]:
election.raw <- read_delim("election.csv", delim = ",") %>% mutate(candidate=as.factor(candidate))


census_meta <- read_delim("metadata.csv", delim = ";", col_names = FALSE) 
census <- read_delim("census.csv", delim = ",")
Parsed with column specification:
cols(
  county = col_character(),
  fips = col_character(),
  candidate = col_character(),
  state = col_character(),
  votes = col_double()
)

Parsed with column specification:
cols(
  X1 = col_character(),
  X2 = col_character(),
  X3 = col_character()
)

Parsed with column specification:
cols(
  .default = col_double(),
  State = col_character(),
  County = col_character()
)

See spec(...) for full column specifications.

Description of Data

The Data that we are running an analysis is such that each observation consists of five main characterstics:

1.county: the name of the county where the voting was done
2.fips: fips also know as the Federal Information Processing Standard Publication 6-4(FIPS 6-4) is a five digit
Federal Information Processing Standards code which uniquely identified counties.The five-digit code uses the two digits of FIPS state code followed by the three digits of the county code within the state.
3.candidate: the candidate name of the candidate that was voted for in that particular county
4.state: the state abbreviation of the state where the county is located and where the voting took place.
5.votes: the number of times that a candidate was voted for in that county

Each observation is expressed as a row in a table an each characteristic is set as a column. We can glimpse the data taking for example Los Angeles county.

In [40]:
#kable(election.raw %>% filter(county == "Los Angeles County"))  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
election.raw %>% filter(county == "Los Angeles County")%>% kable('html')%>% as.character()%>%
display_html()
county fips candidate state votes
Los Angeles County 6037 Hillary Clinton CA 2464364
Los Angeles County 6037 Donald Trump CA 769743
Los Angeles County 6037 Gary Johnson CA 88968
Los Angeles County 6037 Jill Stein CA 76465
Los Angeles County 6037 Gloria La Riva CA 21993

We can observe the number of times that Los Angeles county voted for each presidential candidate, with the candidates identified by name , the votes expressed as integers, the state expressed as an abbreviation and the county identified by fips code. Further there are some rows in our data that are summary rows, these rows contain NA values for the county column: There are two types of summary rows:
1.Federal-level summary rows have fips value of US.
2.State-level summary rows have names of each states as fips value.

In [41]:
#kable(head(election.raw %>% filter(is.na(county)==TRUE),10))  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)

#kable(tail(election.raw %>% filter(is.na(county)==TRUE),10))  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)

head(election.raw %>% filter(is.na(county)==TRUE),10)%>% kable('html')%>% as.character()%>%
display_html()
tail(election.raw %>% filter(is.na(county)==TRUE),10)%>% kable('html')%>% as.character()%>%
display_html()
county fips candidate state votes
NA US Donald Trump US 62984825
NA US Hillary Clinton US 65853516
NA US Gary Johnson US 4489221
NA US Jill Stein US 1429596
NA US Evan McMullin US 510002
NA US Darrell Castle US 186545
NA US Gloria La Riva US 74117
NA US Rocky De La Fuente US 33010
NA US None of these candidates US 28863
NA US Richard Duncan US 24235
county fips candidate state votes
NA VT Gary Johnson VT 10078
NA VT Jill Stein VT 6758
NA VT Rocky De La Fuente VT 1063
NA VT Gloria La Riva VT 327
NA WY Donald Trump WY 174419
NA WY Hillary Clinton WY 55973
NA WY Gary Johnson WY 13287
NA WY Jill Stein WY 2515
NA WY Darrell Castle WY 2042
NA WY Rocky De La Fuente WY 709

Above we can see a glimpse of the summary observations, where if we have a summary for a state level we have an NA value for county and the fips column contains the state abbreviation. And if it is a federal summary we have an NA value for the county and the fips column contains US. That is if an observation is a summary it has a NA value for the county and a non-numerical valu for the fips code.

Cleaning the data

Before proceeding with our analysis of the data it is important to deal with any occurrence of faulty or flawed data. A further inspection of the data reveals a problem. It is apparent that for values of 2000 for the fips column, that the respective county is not available, instead in its place is a value of NA.

In [42]:
#kable(election.raw %>% filter(fips==2000))  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
election.raw %>% filter(fips==2000)%>% kable('html')%>% as.character()%>%
display_html()
county fips candidate state votes
NA 2000 Donald Trump AK 163387
NA 2000 Hillary Clinton AK 116454
NA 2000 Gary Johnson AK 18725
NA 2000 Jill Stein AK 5735
NA 2000 Darrell Castle AK 3866
NA 2000 Rocky De La Fuente AK 1240

On first thought it seems that it would seem reasonable to look up the county that corresponds to the fips code of 2000, but looking up the county for the fips code 2000 results in the realization that this code does not exist. On second thought it would seem reasonable that if we had enough information on the state of AK that we can deduce to which county these values pertain to so we inspect the available date for AK:

In [43]:
#kable(election.raw %>% filter(state=='AK'))  %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"), full_width=FALSE)
election.raw %>% filter(state=='AK')%>% kable('html')%>% as.character()%>%
display_html()
county fips candidate state votes
NA AK Donald Trump AK 163387
NA AK Hillary Clinton AK 116454
NA AK Gary Johnson AK 18725
NA AK Jill Stein AK 5735
NA AK Darrell Castle AK 3866
NA AK Rocky De La Fuente AK 1240
NA 2000 Donald Trump AK 163387
NA 2000 Hillary Clinton AK 116454
NA 2000 Gary Johnson AK 18725
NA 2000 Jill Stein AK 5735
NA 2000 Darrell Castle AK 3866
NA 2000 Rocky De La Fuente AK 1240

We observe that the only available data that we have for Alaska is the state summary and the data for where the county code is 2000. With this information we cannot reasonably deduce to which county this data belongs to. Therefore the best practice is to remove the data where the county code is 2000.

In [44]:
dim(election.raw)
  1. 18351
  2. 5

The dimensions of our data before removing are 18351 rows and 5 columns.

In [45]:
election.raw<-election.raw[!(election.raw$fips==2000),]
dim(election.raw)
  1. 18345
  2. 5

After removing the observations where the county code is 2000 the dimensions for the data are 18345 rows and 5 columns.

For the purposes of our analysis it is practical to remove federal and state level summaries of election voting. These will interfere with the analysis that we want to conduct. Our current dimensions inclusive of state and federal summaries are 18345 rows and 5 columns.

In [46]:
election_federal <- election.raw %>% filter(fips == 'US' & is.na(county))

election_state <- election.raw %>% filter(fips != "US" & is.na(county))

election <- election.raw %>% filter(fips != "US" & as.character(fips) != as.character(state))

election.raw<-election.raw[!(is.na(election.raw$county)==TRUE),]
dim(election.raw)
  1. 18011
  2. 5

After removing all summaries(federal and state) the dimensions of our data are now 18011 rows and 5 columns.

Initial visual analysis of votes per candidate

In the 2016 election there were 31 candidates for the presidential election. We can run an initial visualization on the votes received by each the candidates, to get a senese of the outcome of the election and the distribution of votes.

In [47]:
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)

ggplot(election_federal, aes(x = candidate, y = votes/1000000)) + 
  geom_bar(stat = "identity") + coord_flip() + scale_y_continuous() + ylab("Votes (in millions)")+
theme(text = element_text(size = 30), element_line(size = .1))

Creating columns for winning candidate by county and by state

For simplicity of analysis we create county_winner and state_winner by taking the candidate with the highest proportion of votes.To create county_winner we start with election, group by fips, compute total votes, and create variable pct = votes/total. Then choose the highest row using top_n. We follow a similar approach with state_winner arriving at an aggregated column containing the winner for every state.

In [48]:
county_group <- group_by(election, fips)
county_total <- summarize(county_group, total = sum(votes)) 
county.group <- left_join(county_group, county_total, by = "fips") 
county_pct <- mutate(county.group, pct = votes/total) 
county_winner <- top_n(county_pct, n =1)

state_group <- group_by(election_state, state)
state_total <- summarize(state_group, total = sum(votes)) 
state.group <- left_join(state_group, state_total, by = "state") 
state_pct <- mutate(state.group, pct = votes/total)
state_winner <- top_n(state_pct, n= 1)
`summarise()` ungrouping output (override with `.groups` argument)

Selecting by pct

`summarise()` ungrouping output (override with `.groups` argument)

Selecting by pct

Visualizing State and County Election Outcomes on U.S Map

Visualization is crucial for gaining insight and intuition during data mining. We will map our data onto map in order to have visual representations of the results on the county level and on the state levels.

The R package ggplot2 can be used to draw maps and we make use of it to create a map visualization of our data.

The variable states contain information to draw white polygons, and fill-colors are determined by region. We great an initial visualization of the U.S counties and create a visual map of the counties colored by the counties by using counties = map_data("county").

In [49]:
counties=map_data("county")
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)


ggplot(data = counties) + 
  geom_polygon(aes(x = long, y = lat, fill = subregion, group = group), color = "white") + 
  coord_fixed(1.3) +
  guides(fill=FALSE)+  # color legend is unnecessary and takes too long
  theme(text = element_text(size = 30), element_line(size = .1))

Visualizing the results by state

Now we color the map by the winning candidate for each state. We'll combine the two datasets based on state name. However, the state names are in different formats in the two tables: e.g. AZ vs. arizona. Before using left_join(), we create a common column by creating a new column for states named fips=state.abb[match(states$region, casefold(state.name) )]. We then combine states variable and state_winner we created earlier using left_join(). A call to left_join() takes all the values from the first table and looks for matches in the second table. If it finds a match, it adds the data from the second table; if not, it adds missing values. After doing this we visualize our results:

In [50]:
states <- map_data("state")
states <- states %>% 
  mutate(fips=state.abb[match(states$region, casefold(state.name) )]) 
states <- left_join(states, state_winner, by="fips")

options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)

ggplot(data = states) +
  geom_polygon(aes(x = long, y = lat, fill = candidate, group = group),colour="white") + 
  coord_fixed(1.3)+ 
  guides(fill=guide_legend(title="Candidate"))+theme(text = element_text(size = 30), element_line(size = .1))

Where a red signifies that Donald Trump won that state and where a blue signifies that Hilary Clinton won.

Visualizign the results by County

The variable county does not have fips column. So we will create one by pooling information from maps::county.fips. We split the polyname column to region and subregion,use left_join() combine county.fips into county and left_join() previously created variable county_winner. Finally we visualize the election results on a county level colored by winner for each county.

In [51]:
oldw <- getOption("warn")
options(warn = -1)


county_split=separate(maps::county.fips,polyname,c("region", "subregion"),sep="," ) 

county_combined=left_join(county_split,counties,by=c("region", "subregion")) 
county_combined$fips=as.factor(county_combined$fips) 
county_new=left_join(county_combined,county_winner)

options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(data = county_new) +
  geom_polygon(aes(x = long, y = lat, fill = county_new$candidate, group = group),colour="white" ) + 
  coord_fixed(1.3)+
  guides(fill=guide_legend(title="Candidate"))+theme(text = element_text(size = 30), element_line(size = .1))
options(warn = oldw)
Joining, by = "fips"

Where a red signifies that Donald Trump won that state and where a blue signifies that Hilary Clinton won.

Further Visual Analysis into demographics of interest

It would be interesting to visualize the poverty percentage by state, given that the United States holds very varied views on social-economic systems that are often thought to be linked to economic status of groups. We can then visualize the percentage of povery for each state.

In [52]:
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)

ggplot(data = census, aes(y = as.factor(State), x = Poverty, fill = ..x..)) + 
    geom_density_ridges_gradient(scale = 4, rel_min_height = 0.01) +
    scale_fill_viridis_c(name = "Poverty Percentage", option = "C") +
    ggtitle('Poverty Percentage by State') + 
    ylab("State") + xlab("Poverty Percentage")+
theme(text = element_text(size = 30), element_line(size = .1))
Picking joint bandwidth of 2.29

This ridgeplot illustrates the poverty percentage of each state as most states hover around the range of 8%-12%.

It would also be interesting to visualize the percentage of the black population in each state given that social issues surrounding the black community have gained a greater amount of political traction and that this issues are commonly associated with social ideas on which political ideologies differ greatly.

In [53]:
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)

ggplot(data = census, aes(y = as.factor(State), x = Black, fill = ..x..)) + 
    geom_density_ridges_gradient(scale = 4, rel_min_height = 0.01) +
    scale_fill_viridis_c(name = "Black Population Percentage", option = "C") +
    ggtitle('Black Population Percentage by State') + 
    ylab("State") + xlab("Black Population Percentage")+theme(text = element_text(size = 30), element_line(size = .1))
Picking joint bandwidth of 2.12

This plot shows the Black population percentage per state as the vast majority of states possess a very low percentage under 5%.

It would also be interesting to see a visual representation of the white population given that this is the predominant population in the United States, and in which a great deal of the election results are attributed to despite the amount of variance that is present in this demographic.

In [54]:
census1=census %>% group_by(State) %>% summarise(White_Pct=mean(White,na.rm=TRUE))
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)
ggplot(census1,aes(x=State,y=White_Pct)) + 
   geom_bar(stat="identity")+
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
   ggtitle("Percentage of White population in each State") + 
   ylab("White Population Percentage")+theme(text = element_text(size = 50), element_line(size = .1))
`summarise()` ungrouping output (override with `.groups` argument)

We can observe that in the 46 states, the White population comprises of the majority population in regards to race/ethnicity.

Cleaning and preparing Census data for use

The census data contains high resolution information (more fine-grained than county-level).We will aggregate the information into county-level data by computing TotalPop-weighted average of each attributes for each county. We can accomplish this in accordance to our analysis by first cleaning the census data census.del starting with the attribute census by filter out any rows with missing values. We then convert {Men, Employed, Citizen} attributes to percentages (meta data seems to be inaccurate) and compute Minority attribute by combining {Hispanic, Black, Native, Asian, Pacific} and after creating Minority we remove {Walk, PublicWork, Construction}.

Now we create census.subct starting with census.del from above and group_by() two attributes {State, County}, use add_tally() to compute CountyTotal and compute the weight by TotalPop/CountyTotal.

Finally we create census.ct starting with census.subct and use summarize_at() to compute weighted sum.

In [55]:
census.del <- na.omit(census) %>% mutate(Men = Men/TotalPop*100,
              Employed = Employed/TotalPop*100,
              Citizen = Citizen/TotalPop*100,
              Minority = Hispanic+Black+Native+Asian+Pacific) %>%
select(-Women, -Hispanic, -Native, -Black, -Asian, -Pacific, -Construction,-Walk, -PublicWork) #census.del <- census.del[,c(1:6,29,7:28)] # reordering (want minority next to white)

census.subct<-census.del%>%
  group_by(State,County) %>%
  add_tally(TotalPop) %>%
  mutate(CountyTotal=n)%>%
  mutate(Weight=TotalPop/CountyTotal) %>%
  select(-n)


census.ct <- census.subct %>% summarize_at(vars(  Men, White, Citizen, Income, IncomeErr, IncomePerCap,   IncomePerCapErr, Poverty, ChildPoverty, Professional, Service, Office,  Production, Drive, Carpool, Transit, OtherTransp, WorkAtHome, MeanCommute, Employed, PrivateWork, SelfEmployed, FamilyWork, Unemployment, Minority, CountyTotal),  funs(weighted.mean(., Weight)))

Observing the first few rows of census.ct this is what our data looks like.

In [56]:
head(census.ct[,1:7])%>% kable('html')%>% as.character()%>%
display_html()
head(census.ct[,8:14])%>% kable('html')%>% as.character()%>%
display_html()
head(census.ct[,15:21])%>% kable('html')%>% as.character()%>%
display_html()
head(census.ct[,22:27])%>% kable('html')%>% as.character()%>%
display_html()
State County Men White Citizen Income IncomeErr
Alabama Autauga 48.43266 75.78823 73.74912 51696.29 7771.009
Alabama Baldwin 48.84866 83.10262 75.69406 51074.36 8745.050
Alabama Barbour 53.82816 46.23159 76.91222 32959.30 6031.065
Alabama Bibb 53.41090 74.49989 77.39781 38886.63 5662.358
Alabama Blount 49.40565 87.85385 73.37550 46237.97 8695.786
Alabama Bullock 53.00618 22.19918 75.45420 33292.69 9000.345
IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office
24974.50 3433.674 12.91231 18.70758 32.79097 17.17044 24.28243
27316.84 3803.718 13.42423 19.48431 32.72994 17.95092 27.10439
16824.22 2430.189 26.50563 43.55962 26.12404 16.46343 23.27878
18430.99 3073.599 16.60375 27.19708 21.59010 17.95545 17.46731
20532.27 2052.055 16.72152 26.85738 28.52930 13.94252 23.83692
17579.57 3110.645 24.50260 37.29116 19.55253 14.92420 20.17051
Production Drive Carpool Transit OtherTransp WorkAtHome MeanCommute
17.15713 87.50624 8.781235 0.0952590 1.3059687 1.8356531 26.50016
11.32186 84.59861 8.959078 0.1266209 1.4438000 3.8504774 26.32218
23.31741 83.33021 11.056609 0.4954032 1.6217251 1.5019456 24.51828
23.74415 83.43488 13.153641 0.5031366 1.5620952 0.7314679 28.71439
20.10413 84.85031 11.279222 0.3626321 0.4199411 2.2654133 34.84489
25.73547 74.77277 14.839127 0.7732160 1.8238247 3.0998783 28.63106
Employed PrivateWork SelfEmployed FamilyWork Unemployment Minority
43.43637 73.73649 5.433254 0.0000000 7.733726 22.53687
44.05113 81.28266 5.909353 0.3633269 7.589820 15.21426
31.92113 71.59426 7.149837 0.0897742 17.525557 51.94382
36.69262 76.74385 6.637936 0.3941515 8.163104 24.16597
38.44914 81.82671 4.228716 0.3564928 7.699640 10.59474
36.19592 79.09065 5.273684 0.0000000 17.890026 76.53587

Dimensionality reduction

We would like to run the principal component analysis in order to reduce the dimensionality of our data to a basis of attributes that are most likely to characterize and distinguish observations into classifications through the variance of the attributes. We can begin by looking at the variance of our data on the county level and on the sub county leve, before running principal component analysis.

In [57]:
census_var<-data.frame(apply(census.ct[sapply(census.ct,is.numeric)] , 2, var))
subcensus_var<-data.frame(apply(census.subct[sapply(census.subct,is.numeric)],2,var))
names(census_var)[1]<-"County Variance"
names(subcensus_var)[1]<-"Sub-County Variance"
census_var
subcensus_var
A data.frame: 26 × 1
County Variance
<dbl>
Men5.416023e+00
White5.264985e+02
Citizen2.782521e+01
Income1.871915e+08
IncomeErr5.807218e+06
IncomePerCap3.804072e+07
IncomePerCapErr1.210144e+06
Poverty6.931216e+01
ChildPoverty1.317339e+02
Professional3.993572e+01
Service1.360058e+01
Office1.022755e+01
Production3.300854e+01
Drive5.737029e+01
Carpool8.508591e+00
Transit9.442292e+00
OtherTransp2.920815e+00
WorkAtHome9.714816e+00
MeanCommute3.115353e+01
Employed4.796921e+01
PrivateWork6.274646e+01
SelfEmployed1.531612e+01
FamilyWork1.983356e-01
Unemployment1.707899e+01
Minority5.265263e+02
CountyTotal1.010528e+11
A data.frame: 28 × 1
Sub-County Variance
<dbl>
TotalPop4.359715e+06
Men1.636172e+01
White9.409886e+02
Citizen1.129972e+02
Income8.216231e+08
IncomeErr3.488153e+07
IncomePerCap2.199818e+08
IncomePerCapErr8.520643e+06
Poverty1.676799e+02
ChildPoverty3.651913e+02
Professional2.222947e+02
Service6.618685e+01
Office3.269766e+01
Production5.731659e+01
Drive2.261628e+02
Carpool2.720594e+01
Transit1.364397e+02
OtherTransp6.305265e+00
WorkAtHome1.335178e+01
MeanCommute4.809799e+01
Employed8.211358e+01
PrivateWork6.618434e+01
SelfEmployed1.474701e+01
FamilyWork2.090834e-01
Unemployment3.361937e+01
Minority9.335587e+02
CountyTotal3.711516e+12
Weight9.334002e-03

We can then move on to observing the means on the subcounty and on the county level.

In [58]:
meanofcensus.ct<-data.frame(apply(as.matrix(census.ct[, 3:27]), 2, mean))
meanofcensus.subct<-data.frame(apply(as.matrix(census.subct[, 3:27]), 2, mean))
names(meanofcensus.ct)[1]<-"County Means"
names(meanofcensus.subct)[1]<-"Sub-County Means"
meanofcensus.ct
meanofcensus.subct
A data.frame: 25 × 1
County Means
<dbl>
Men4.995373e+01
White7.547990e+01
Citizen7.467507e+01
Income4.722170e+04
IncomeErr7.081758e+03
IncomePerCap2.400075e+04
IncomePerCapErr3.120460e+03
Poverty1.757066e+01
ChildPoverty2.383369e+01
Professional3.080041e+01
Service1.847286e+01
Office2.218309e+01
Production1.580602e+01
Drive7.914627e+01
Carpool1.031949e+01
Transit9.897334e-01
OtherTransp1.625164e+00
WorkAtHome4.605074e+00
MeanCommute2.331344e+01
Employed4.302471e+01
PrivateWork7.418303e+01
SelfEmployed7.916454e+00
FamilyWork2.873302e-01
Unemployment8.157064e+00
Minority2.266877e+01
A data.frame: 25 × 1
Sub-County Means
<dbl>
TotalPop 4383.554168
Men 49.104724
White 62.064004
Citizen 71.280147
Income57259.136318
IncomeErr 9132.597261
IncomePerCap28520.769425
IncomePerCapErr 3925.600644
Poverty 16.887969
ChildPoverty 22.469815
Professional 34.799592
Service 19.077035
Office 23.926241
Production 12.884780
Drive 75.711413
Carpool 9.637483
Transit 5.440670
OtherTransp 1.880372
WorkAtHome 4.322362
MeanCommute 25.690046
Employed 45.664335
PrivateWork 79.006662
SelfEmployed 6.223640
FamilyWork 0.170106
Unemployment 9.003846

We can observe the variances of the data and see that they are not all the same, there are some variances that are by orders of magnitude larger than the rest. If we proceed with the reduction of dimensionality without scaling the variables what may happen is that most of the principal component that we observe would be driven by those variables with larger variances. It is then a good idea to scale the variables before running our principal component analysis to reduce the dimensionality of our data. Further centering the data before running principal component can help us avoid possible misleading 1st principal components where it may appear that the the direction of principal component analysis is in one direction but it is in fact in another. The means on the subcounty and county level also seem to be pretty far apart which provides indication that the principal components that result from running PCA may be misleading as mentioned . Therefore for the data at both the county and the sub-county level it is best practice to both center and scale the data. If we have vastly different means then it is good practice to center, the scale controls the standard deviation and it is also good practice to account for large impacts of variance. Now we can run the principal component analysis:

In [59]:
# PCA on county-level data with first two principle components 
county_model <- prcomp(subset(census.ct, select = -c(State, County)), center = TRUE, scale = TRUE)
ct.pc <- data.frame(county_model$rotation[, 1:2])

# subcounty PCA with first two principle components. 
subct_model <- prcomp(subset(census.subct, select = -c(State, County, TotalPop, Weight)), center = TRUE, scale = TRUE)
subct.pc <- data.frame(subct_model$rotation[, 1:2])

# Grabbing the three features the largest values for PC1 in both county and subcounty models.
sort(abs(county_model$rotation[, 1]), decreasing = TRUE)[1:3]
sort(abs(subct_model$rotation[, 1]), decreasing = TRUE)[1:3]

# Checking the loading vectors for the PCA models.
#data.frame(county_model$rotation[, 1])
#data.frame(subct_model$rotation[, 1])
IncomePerCap
0.353076716077694
ChildPoverty
0.342153045598442
Poverty
0.340583243369814
IncomePerCap
0.318154894206423
Professional
0.306307909270984
Poverty
0.304367450642174

the 3 features with the largest absolute values of the first principal component on the county level are Poverty,ChildPoverty, and IncomePerCap and on the Sub-County level these are IncomePerCap, Professional, and Poverty. We observe that at the County the features that have opposite signs are Men, White, Citizen, Poverty, ChildPoverty, Service, Drive, Transit, OtherTransp, PrivateWork, FamilyWork, Unemployment, Minority and CountyTotal. At the Sub-County level the features that have opposite signs are Men, Women, White, Citizen, Office, Production, Drive, WorkAtHome, MeanCommute, SelfEmployed, FamilyWork, Unemployment and Minority If a feature has opposite signs, it does not change the variance within the first component as the weights also change the sign and therefore the interpretation remains the same. The interpretation stays the same because if you imagine a biplot and the image is rotated 180 degrees. The relation between the loadings and data points is exactly the same, therefore having the same interpretation. The signs of a PCA scores and loading are arbitrary since it can be flipped, but only if the sign of both its scores and loadings are changed at the same time.

Minimum number of Principal Components and Proportion of Variance Explained

We can determine the number of minimum number of PCs needed to capture 90% of the variance for both the county and sub-county analyses. Further we can visualize the proportion of variance explained (PVE) and cumulative PVE for both county and sub-county analyses as well.

In [60]:
ct.PVE <- county_model$sdev^2/sum(county_model$sdev^2)
# Calculating cumulative PVE for county-level data.
Cum_PVE <- cumsum(ct.PVE)

# Calculating PVE for subcounty-level data.
subct.PVE <- subct_model$sdev^2/sum(subct_model$sdev^2)

# Calculating cumulative PVE for subcounty-level data.
Cum_PVE1 <- cumsum(subct.PVE)
options(repr.plot.width = 30, repr.plot.height = 30, repr.plot.res = 100)

par(mfrow=c(2, 2))
plot(ct.PVE, type="l", lwd=3, main="PVE for County",cex.axis=3,cex.lab=2,cex.main=4)
plot(Cum_PVE, type="l", lwd=3,main=" Cumulative PVE for County",cex.axis=3,cex.lab=2, cex.main=4)
plot(subct.PVE, type="l", lwd=3, main="PVE for Subcounty",cex.axis=3,cex.lab=2,cex.main=4)
plot(Cum_PVE1, type="l", lwd=3,main=" Cumulative PVE for SUbcounty",cex.axis=3,cex.lab=2,cex.main=4)


# Finding number of principle components to explain at least 90% of the variation for county-level data.
which(Cum_PVE == min(Cum_PVE[Cum_PVE >= .9]))


# Finding number of principle components to explain at least 90% of the variation for subcounty-level data.
which(Cum_PVE1 == min(Cum_PVE1[Cum_PVE1 >= .9]))
14
15

For county, it takes 14 PC's to capture 90% of the variance.

For sub-county, it takes 15 Pc's to capture 90% of the variance.

Hierachical clustering with complete linkage: PCA v.s Non-PCA

With census.ct we can perform hierarchical clustering with complete linkage. We first run hierachical clustering without the most significant principal components and then re-run the hierarchical clustering algorithm using the first 5 principal components of ct.pc as inputs instead of the originald features. We do this by cutting our tree to partition the observations into 10 clusters. We can interpret the effectiveness of the methods by assessing which method better placed a random county. In our case we will use San Mateo as an indicator county.

In [61]:
oldw <- getOption("warn")
options(warn = -1)

ct_dist=dist(census.ct,method = "euclidean")
set.seed(4)

cluster=hclust(ct_dist^(1/4), method="complete")
dend.ct=as.dendrogram(cluster)
dend.ct=color_branches(dend.ct,k=10)
dend.ct=color_labels(dend.ct,k=10)
dend.ct=set(dend.ct,"labels_cex",0.85)
dend.ct=set_labels(dend.ct, labels=census.ct$Type[order.dendrogram(dend.ct)])

plot(dend.ct,horiz=T,main = "Dendrogram of census.ct Colored with 10 Clusters",cex.main=3,cex.axis=2)

options(warn = oldw)

Now we can visualize the second dendogram:

In [63]:
ct.pc1 =(county_model$x[,1:5])
ct.dist=dist(ct.pc1,method = "euclidean")

cluster1=hclust(ct.dist, method="complete")
dend.ct=as.dendrogram(cluster1)
dend.ct=color_branches(dend.ct,k=10)
dend.ct=color_labels(dend.ct,k=10)
dend.ct=set(dend.ct,"labels_cex",0.85)
dend.ct=set_labels(dend.ct, labels=census.ct$Type[order.dendrogram(dend.ct)])
plot(dend.ct,horiz=T,main = "Dendrogram of ct.pc Colored with 10 Clusters",cex.main=3,cex.axis=2)

SM<-cutree(cluster,10)
SMclust<-SM[which(census.ct$County=="San Mateo")] 

SM1<-cutree(cluster1,10)
SMclust1<-SM1[which(census.ct$County=="San Mateo")]
Warning message:
“Unknown or uninitialised column: `Type`.”
Warning message in `labels<-.dendrogram`(`*tmp*`, value = labels):
“The lengths of the new labels is shorter than the number of leaves in the dendrogram - labels are recycled.”
Warning message in rep(new_labels, length.out = leaves_length):
“'x' is NULL so the result will be NULL”
In [64]:
plot(county_model$x, col = SM,  main="Hierarchical Clustering on County", sub="clusters=10", cex.lab = 1.5,cex.main=3,cex.axis=2)
  abline(v = ct.pc[SMclust,1], col = "blue")
In [65]:
plot(ct.pc1 ,col=SM1,
      main="Hierarchical Clustering on County with 5 Principal Components", sub="clusters=10", cex.lab=1.5,cex.main=3,cex.axis=2)
 abline(v = ct.pc[SMclust1,], col = "blue")

By running hierarchical clustering on census.ct, San Mateo county is place in the 2nd cluster.On the other hand,running hierarchical clustering on the principal components that we arrived at after running a principal component analysis places San Mateo county on the 7th cluster. The use of the principal components in hierarchical clustering provides a better classification of the San Mateo county. Principal component analysis provides us with the most impactful attributes that should be used to classify, by reducing the dimensionality through the distinction of the variances the data is reduced to a basis consisting of the most variable attributes. Therefore through the use of the principal components, when running hierarchical clustering we are clustering by the features that provide the most distinction between observations, that is, the characteristics that markedly define observation. So San Mateo is expected to be clustered in accordance with the most characterizing attributes found via PCA. On the other hand not using the principal components in hierarchical clustering leads to clustering without great consideration to the most characterizing attributes, this will lead San Mateo to be clustered with consideration to attributes that dont necessarily distinguish it and it can therefore be clustered with observations that cannot be said to be similar to San Mateo. Therefore San Mateo is not as appropriately clustered compared to when principal components are used.

Classification

Our data is now ready for classification so we commence by pruning the tree to minimize misclassification error using generated folds for the cross-validation. We then visualize the trees before and after pruning and assess the training and test errors and save these to a records variable.

In [66]:
tmpwinner <- county_winner %>% ungroup %>%
  mutate(state = state.name[match(state, state.abb)]) %>%               ## state abbreviations
  mutate_at(vars(state, county), tolower) %>%                           ## to all lowercase
  mutate(county = gsub(" county| columbia| city| parish", "", county))  ## remove suffixes
tmpcensus <- census.ct %>% mutate_at(vars(State, County), tolower)

election.cl <- tmpwinner %>%
  left_join(tmpcensus, by = c("state"="State", "county"="County")) %>% 
  na.omit

## save meta information
election.meta <- election.cl %>% select(c(county, fips, state, votes, pct, total))

## save predictors and class labels
election.cl = election.cl %>% select(-c(county, fips, state, votes, pct, total))

set.seed(10) 
n <- nrow(election.cl)
in.trn <- sample.int(n, 0.8*n) 
trn.cl <- election.cl[ in.trn,]
tst.cl <- election.cl[-in.trn,]

set.seed(20) 
nfold <- 10
folds <- sample(cut(1:nrow(trn.cl), breaks=nfold, labels=FALSE))

calc_error_rate = function(predicted.value, true.value){
  return(mean(true.value!=predicted.value))
}
records = matrix(NA, nrow=4, ncol=2)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso","Knn")


#Fit Tree model
tree=tree(candidate~.,data=trn.cl)
draw.tree(tree, nodeinfo = TRUE,cex=1)
In [67]:
set.seed(102)
# cross validation
cvtree=cv.tree(tree,FUN=prune.misclass, K=10,rand=folds)
#determine best size
best.size.cv=cvtree$size[max(which(cvtree$dev==min(cvtree$dev)))]


#Prune Tree to a tree with only 10 leaf nodes
prune=prune.tree(tree,best=best.size.cv, method = "misclass")
draw.tree(prune, nodeinfo = TRUE,cex=1.2)


#Training Error
records = matrix(NA, nrow=4, ncol=2)

set.seed(98)
pred.prune.train = predict(prune,trn.cl,type="class")
records[1,1]<-calc_error_rate(pred.prune.train, trn.cl$candidate)

#Test Error
pred.prune.test=predict(prune,tst.cl,type="class")
records[1,2]<-calc_error_rate(pred.prune.test, tst.cl$candidate)

When comparing the two, we recognized how sparse and how many more nodes the unpruned tree possesses in comparsion to the pruned tree. The unpruned tree has 27 total nodes as the pruned tree has nearly half with 13 nodes. By pruning the tree we are able to reduce the risk of overfitting while maintaining as much information as possible. The total classified correct percetange also only decreased from 93.7% to 93.4%, preserving as much information as possible and simplifying the tree. After pruning, our decision tree looks a lot easier to understand and allows us create a clear story that the audience may follow.

In the pruned tree, we observed that Donald Trump had dominated Clinton, he dominated in counties of greater White population and where people don't take public transportation. While Clinton, although struggling, leads Trump in counties of lower income, people that use public transport, and in larger counties.

Prediction of winning candidate

We run a logistic regression to predict the winning candidate in each county and save the training and test errors into the records object into which we inputted our test and training errors from the last section and assess our results.

In [68]:
oldw <- getOption("warn")
options(warn = -1)






trn.cl$candidate <- factor(trn.cl$candidate, levels=c("Donald Trump", "Hillary Clinton"))
tst.cl$candidate <- factor(tst.cl$candidate, levels=c("Donald Trump", "Hillary Clinton"))
glm.fit=glm(candidate~.,data=trn.cl,family=binomial)

print(summary(glm.fit))

#Training Error
prob.training = round(predict(glm.fit, type="response"),digits=5)
trn.cl1 = trn.cl %>%
  mutate(predcand=as.factor(ifelse(prob.training<=0.5, "Donald Trump", "Hillary Clinton")))

records[2,1]<-calc_error_rate(trn.cl1$predcand,trn.cl$candidate)

#Test Error
prob.test = round(predict(glm.fit, tst.cl, type="response"),digits=5)
tst.cl1 = tst.cl %>%
  mutate(predcand=as.factor(ifelse(prob.test<=0.5, "Donald Trump", "Hillary Clinton")))

records[2,2]<-calc_error_rate(tst.cl1$predcand,tst.cl$candidate)

options(warn = oldw)
Call:
glm(formula = candidate ~ ., family = binomial, data = trn.cl)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.9948  -0.2600  -0.1104  -0.0399   3.5372  

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -2.487e+01  9.452e+00  -2.631  0.00850 ** 
Men              9.589e-02  4.824e-02   1.988  0.04682 *  
White           -1.689e-01  6.763e-02  -2.497  0.01253 *  
Citizen          1.302e-01  2.796e-02   4.657 3.20e-06 ***
Income          -8.708e-05  2.729e-05  -3.192  0.00142 ** 
IncomeErr       -3.326e-06  6.365e-05  -0.052  0.95833    
IncomePerCap     2.671e-04  6.699e-05   3.987 6.70e-05 ***
IncomePerCapErr -3.605e-04  1.715e-04  -2.102  0.03560 *  
Poverty          4.741e-02  4.052e-02   1.170  0.24193    
ChildPoverty    -1.582e-02  2.458e-02  -0.644  0.51971    
Professional     2.802e-01  3.870e-02   7.240 4.48e-13 ***
Service          3.242e-01  4.758e-02   6.814 9.46e-12 ***
Office           7.590e-02  4.465e-02   1.700  0.08917 .  
Production       1.668e-01  4.093e-02   4.076 4.58e-05 ***
Drive           -2.097e-01  4.617e-02  -4.542 5.58e-06 ***
Carpool         -1.736e-01  5.929e-02  -2.928  0.00342 ** 
Transit          7.578e-02  9.424e-02   0.804  0.42134    
OtherTransp     -6.258e-02  9.503e-02  -0.659  0.51021    
WorkAtHome      -1.657e-01  7.198e-02  -2.302  0.02133 *  
MeanCommute      5.616e-02  2.422e-02   2.319  0.02039 *  
Employed         2.056e-01  3.354e-02   6.132 8.69e-10 ***
PrivateWork      1.010e-01  2.189e-02   4.615 3.93e-06 ***
SelfEmployed     1.968e-02  4.678e-02   0.421  0.67394    
FamilyWork      -8.873e-01  3.765e-01  -2.357  0.01844 *  
Unemployment     2.073e-01  3.971e-02   5.219 1.80e-07 ***
Minority        -3.053e-02  6.509e-02  -0.469  0.63902    
CountyTotal      3.511e-07  4.299e-07   0.817  0.41405    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2119.56  on 2455  degrees of freedom
Residual deviance:  854.47  on 2429  degrees of freedom
AIC: 908.47

Number of Fisher Scoring iterations: 7

The Significant variables: White, IncomePerCapErr, WorkAtHome, MeanCommute, FamilyWork

The logistic regression is not consistent with the decision tree analysis as we different signifcant variables. For every one unit change in White, the log odds of winning the county (vs. not winning the county) decreaes by -1.689e-01, holding other variables fixed. For every one unit change in IncomePerCapErr, the log odds of winning the county (vs. not winning the county) decreaes by -3.622e-04, holding other variables fixed.

Finding the optimal lambda and LASSO Regression

Assessing our logistic regression model fit (glm.fit()) we can see that the, fitted probabilities at times are numerically 0 or 1. This is an indication that we have perfect separation (some linear combination of variables perfectly predicts the winner). This is usually a sign that we are overfitting. One way to control overfitting in logistic regression is through regularization. We can use cv.glmnet function from the glmnet library to run K-fold cross validation and select the best regularization parameter for the logistic regression with LASSO penalty.We set alpha=1 to run LASSO regression and set lambda = c(1, 5, 10, 50) * 1e-4 in cv.glmnet() function to set pre-defined candidate values for the tuning parameter $\lambda$.This is because the default candidate values of $\lambda$ in cv.glmnet() is relatively too large for our dataset thus we use pre-defined candidate values. We save training and test errors to the records variable.

In [69]:
x <- model.matrix(candidate~.,trn.cl)[,-1]
y <- ifelse(trn.cl$candidate == "Hillary Clinton",1,0)
cvlasso<-cv.glmnet(x,y, alpha = 1, lambda = c(1, 5, 10, 50) * 1e-4)
cvlasso

lambda_glmfit = glmnet(x,y,family = "binomial", alpha = 1, lambda=cvlasso$lambda.1se)
coef(lambda_glmfit)
fficients.
Call:  cv.glmnet(x = x, y = y, lambda = c(1, 5, 10, 50) * 1e-04, alpha = 1) 

Measure: Mean-Squared Error 

    Lambda Measure       SE Nonzero
min  5e-04 0.07201 0.003601      25
1se  5e-03 0.07316 0.003656      18
27 x 1 sparse Matrix of class "dgCMatrix"
                           s0
(Intercept)     -2.180620e+01
Men              .           
White           -9.626087e-02
Citizen          1.259987e-01
Income           .           
IncomeErr        .           
IncomePerCap     2.527333e-05
IncomePerCapErr  .           
Poverty          3.662372e-02
ChildPoverty     .           
Professional     1.461600e-01
Service          1.588740e-01
Office           .           
Production       .           
Drive           -5.133751e-02
Carpool         -1.085224e-02
Transit          1.595093e-01
OtherTransp      3.218219e-02
WorkAtHome       .           
MeanCommute      .           
Employed         1.375509e-01
PrivateWork      7.146414e-02
SelfEmployed     .           
FamilyWork      -2.821037e-01
Unemployment     1.411764e-01
Minority         .           
CountyTotal      4.274689e-07
Error in eval(expr, envir, enclos): object 'fficients.' not found
Traceback:

The optimal lambda is 0.005 because it gives a much simpler model (only 13 predictors) and still maintains a high level of accuracy

The 13 nonzero coefficients for the penalized fit are White, Citizen, Poverty, Professional, Service, Drive, Carpool, Transit, OtherTransp, Employed, PrivateWork, FamilyWork, and Unemployment. This contrasts with the unpenalized fit which has all 20 nonzero coefficients.

In [70]:
#LASSO FIT
#Training Error
prob.training1 = round(predict(lambda_glmfit,x, type="response"),digits=5)
trn.cl2 = trn.cl %>%
mutate(predcand=as.factor(ifelse(prob.training1<=0.5, "Donald Trump", "Hillary Clinton")))
records[3,1]<-calc_error_rate(trn.cl2$predcand,trn.cl$candidate)
#Test Error
x.test <- model.matrix(candidate~.,tst.cl)[,-1]
prob.test1 = round(predict(lambda_glmfit, x.test, type="response"),digits=5)
tst.cl2 = tst.cl %>%
mutate(predcand=as.factor(ifelse(prob.test1<=0.5, "Donald Trump", "Hillary Clinton")))
records[3,2]<-calc_error_rate(tst.cl2$predcand,tst.cl$candidate)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso","KNN")
records_final<-records[-4,]
records_final
dim(prob.training1)
A matrix: 3 × 2 of type dbl
train.errortest.error
tree0.065553750.06504065
logistic0.070439740.06341463
lasso0.072475570.06991870
  1. 2456
  2. 1

ROC curves for the decision tree, logistic regression and LASSO logistic regression

We plot the ROC curves ROC curves for the decision tree, logistic regression and LASSO logistic regression using predictions on the test data.

In [71]:
pred.prune.test1=predict(prune,tst.cl,type="vector")[,-1]
#Predict using logisitic regression fit
prob.test = round(predict(glm.fit, tst.cl, type="response"),digits=5)
#Predict using LASSO logisitic regression fit
prob.test1 = round(predict(lambda_glmfit, x.test, type="response"),digits=5)
#Constructing ROC Curves
pred = prediction(pred.prune.test1, tst.cl[["candidate"]])
perf = performance(pred, measure="tpr", x.measure="fpr")
pred1 = prediction(prob.test,tst.cl$candidate)
perf1 = performance(pred1, measure="tpr", x.measure="fpr")
pred2 = prediction(prob.test1,tst.cl$candidate)
perf2 = performance(pred2, measure="tpr", x.measure="fpr")
#Plotting all curves
plot(perf, col=2, lwd=3, main="ROC curves for Logistic Regression Fit,  Lasso Logistic Regression Fit and Tree Fit")
abline(0,1)
par(new=TRUE)
plot(perf1, col=4, lwd=3)
par(new=TRUE)
plot(perf2, col=3, lwd=3)
legend(0.7,0.3,legend=c("Tree Fit","Logistic Regression Fit","Lasso Logistic Regression Fit"),col=c(2,4,3),lty=1,cex=0.6)
Error in prediction(pred.prune.test1, tst.cl[["candidate"]]): Number of cross-validation runs must be equal for predictions and labels.
Traceback:

1. prediction(pred.prune.test1, tst.cl[["candidate"]])
2. stop(paste("Number of cross-validation runs must be equal", "for predictions and labels."))

Decision trees are a non-parametric model that are generally very easy to interpret. It's graph is very intuitive and not sensitive to monotone transformations of predictors. But, it's predictive accurcary is not competitive with some other best supervised learning approaches. It also tends to be overfitting with a very high variance as a small change in the data might lead to learning a very different tree.

Logitic Regression is favored due to the fact that it's very efficient, no tuning required, and works better when you remove attributes that are unrelated to the output and attributes that are correlated to each other. It represents a good starting baseline that you can utilize in order to analyze the performance of much more complex models. However, logistic regression cannot solve non-linear problems as it possesses a linear decision surface. As well as its gets outperformed by more complex algorithms and is only a useful tool if all the important independent variables have already been identified.

LASSO logistic Regression is an automaic method that selects features by shrinking co-efficient towards zero to perform variable selection and regularization to improve the prediction accuracy and interpretability of the statistical model it calculates. It outperforms other methods of automatic variable selection providing better results, but it can produce models that don't make sense, ignore important variables, and does not follow the hierarchy principle.

The different classifiers cater to specific questions where the data will cooperate with the classifier in order to produce the best result. Each classifier possesses its own strengths and purpose as decision trees allow audiences to easily follow a "story", but are not as powerful as other models when it comes to predictions. Especially when the question asked is requires a certain data set, the data does not change but utilizing different classifiers will help achieve the most accurate results.

Reflection on anlysis findings

As found when conducting research for this analysis, elections are extremely difficult to predict accurately due to small changes and errors that can result in a compltely incorrect prediction. Elections possess a significant amount of factors that allow for predictions, but sentiment can quickly change from one candidate to another requiring models to base their predictions on the weight of these factors. This research project allowed me and my partner to analyze which factors in particular hold the most weight in trying to calculate the best predictions possible.

In the visualizations of the poverty percentage for each state we hoped to analyze if there was any trends in regards to the population of Blacks and Whites for each state. By examining the white population and poverty levels we can connect it with our decision tree analysis to capture an idea of which states, and their respective counties, Trump or Clinton will win. In states of higher white population, we can expect Trump will a lot of counties due to the White population percentages observed and the observations from the decision tree.

Out of the 3 methods explored we determined that the logistic regression method worked the best as it possessed the highest true positive rate and the lowest training error compared to LASSO logistic regression and decision trees. From our ROC curves we observed that the tree fit has the lowest AUC value as it sits below the logistic regression fit and LASSO logistic regression fit. The logistic regression fit and LASSO logistic regression fit on the ROC graph are both very similiar, but due to the lower training error of the logistic regression, we favor it over the LASSO logistic regression. Also since, we removed any unnecessary and unrelated attributes, this allows for the logistic regression method to work efficiently over the other methods.

We could improve our analysis by incorporating data from 2008 and 2012 in order to see what discrepencies there may be between the 3 elections. As well, it would improve our models as the extra data would help prove if the factors from the 2016 elections held in the 2008 and 2012 elections as well. If they differed then we would discover that the 2016 election may have been a unique circumstance in which would not be accurate when predicting a "normal" election. Data from other senate election may also help improve our analysis as we could examine the voting behavior of those based on party affiliation and if the same factors hold similiar weights to the presidential election.

Further expansion of analyis using KNN

In [72]:
library(class)
library(reshape2)
#exploring additional classification methods:

#knn

Ytrain <- trn.cl$candidate
Xtrain <- trn.cl %>% select(-candidate) 

Ytest <- tst.cl$candidate
Xtest <- tst.cl %>% select(-candidate)


# do.chunk() for k-fold Cross-validation
do.chunk <- function(chunkid, folddef, Xdat, Ydat, ...){ # Function arguments
train = (folddef!=chunkid) # Get training index
Xtr = Xdat[train,] # Get training set by the above index
Ytr = Ydat[train] # Get true labels in training set
Xvl = Xdat[!train,] # Get validation set
Yvl = Ydat[!train] # Get true labels in validation set
predYtr = knn(train=Xtr, test=Xtr, cl=Ytr, ...) # Predict training labels
predYvl = knn(train=Xtr, test=Xvl, cl=Ytr, ...) # Predict validation labels
data.frame(fold = chunkid, # k folds
train.error = mean(predYtr != Ytr), # Training error for each fold
val.error = mean(predYvl != Yvl)) # Validation error for each fold
}

# creating a vector of possible k values
kvec <- c(1, seq(10, 50, length.out=9))
kerrors <- NULL


#cross validation
for (j in kvec) {
tve <- plyr::ldply(1:nfold, do.chunk, folddef=folds,
Xdat= Xtrain, Ydat=Ytrain, k=j)
tve$neighbors <- j
kerrors <- rbind(kerrors, tve)
}
# calculating test errors at each k
errors <- melt(kerrors, id.vars=c("fold","neighbors"), value.name="error")
val.error.means <- errors %>%
filter(variable=="val.error") %>%
group_by(neighbors) %>%
summarise_at(vars(error),funs(mean))
# picking the best k
min.error <- val.error.means %>%
filter(error==min(error))
bestk <- max(min.error$neighbors)
bestk

# calculating training errors at each k
# (by taking mean of each cv result)
train.error.means <- errors %>%
filter(variable=="train.error") %>%
group_by(neighbors) %>%
summarise_at(vars(error),funs(mean))

pred.Ytrain <- knn(train = Xtrain, test = Xtrain, cl = Ytrain, k = bestk)
records[4,1] <- calc_error_rate(pred.Ytrain, Ytrain)

pred.Ytrain <- knn(train=Xtrain, test=Xtest, cl=Ytrain, k=bestk)
records[4,2] <- calc_error_rate(pred.Ytrain, Ytest)
colnames(records) = c("train.error","test.error")
rownames(records) = c("tree","logistic","lasso", "KNN")
records
Attaching package: ‘reshape2’


The following object is masked from ‘package:tidyr’:

    smiths


25
A matrix: 4 × 2 of type dbl
train.errortest.error
tree0.065553750.06504065
logistic0.070439740.06341463
lasso0.072475570.06991870
KNN0.116449510.11544715

We observe that the best k in this instance is k=25. We output the records object onto which we aggregated Knn values. After comparing KNN to the test and training errors of the decision tree, logisitc regression, and LASSO logistic regression it possesses both the largest test error as well as training error out of the four. The logistic regression possesses the lowest training error, but the decision tree owns the lowest test error out of all the methods. KNN got outperformed versus the other methods due to the fact that it's nonparametric and no assumptoins are made about the shape of the decision boundary. However, from this analysis logistic regression dominates it as KNN only performs well versus it when the decision boundary is highly non-linear. In addition KNN doesn't take into account which predictors are important. Both Logistic regression and the decision tree both have similar training errors with the logistic regression having a smaller one, but the decision tree owns a smaller test error by 0.005 making it more favorable especially since decision trees are easy to interpret for the audience.